In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(font_scale=2)
sns.set_style("whitegrid")

Comparing countries with PCA

Now that we've looked at positions for general players, we can try and compare the players from two different countries. This may allow us to predict the winner of a match between two countries, during the World Cup for example.

We'll first compare Brazil, a perennial powerhouse, and Japan, a relative newcomer to professional football. We'll construct two datasets, one with goal-keepers, and one with "regular" players.

In [2]:
df = pd.read_csv("FIFA_2018.csv",encoding = "ISO-8859-1",index_col = 0, low_memory = False)
In [3]:
country_1 = 'Brazil'
country_2 = 'Japan'

D = df[df['Nationality'].isin([country_1, country_2])].copy()

D.head()
Out[3]:
Acceleration Aggression Agility Balance Ball control Composure Crossing Curve Dribbling Finishing ... Sprint speed Stamina Standing tackle Strength Vision Volleys Position Name Nationality Club
2 94 56 96 82 95 92 75 81 96 89 ... 90 78 24 53 80 83 FWD Neymar Brazil Paris Saint-Germain
30 70 77 74 68 80 83 60 61 68 38 ... 74 74 89 81 74 63 DEF Thiago Silva Brazil Paris Saint-Germain
39 77 84 77 82 88 85 90 80 84 67 ... 79 81 85 77 75 54 DEF Marcelo Brazil Real Madrid CF
51 84 82 79 79 81 82 86 78 82 55 ... 88 93 84 80 70 68 MID Alex Sandro Brazil Juventus
54 88 55 92 92 88 85 77 84 88 74 ... 77 80 44 61 87 75 MID Coutinho Brazil Liverpool

5 rows × 38 columns

Construct two datasets, one with goal-keepers (name it D_gk), and one with "regular" players (name it D_reg). The dataset with regular players should have no goal-keeping statistics, and vice versa.

In [4]:
# clear
D_gk = D[D['Position'] == 'GK'].copy()
D_gk = D_gk[['GK diving', 'GK handling', 'GK kicking', 'GK positioning', 'GK reflexes',
            'Nationality']]

D_reg = D[D['Position'] != 'GK'].copy()
D_reg = D_reg.drop(['GK diving', 'GK positioning', 'GK handling', 
                    'GK kicking', 'GK positioning', 'GK reflexes'],1)

Now, we can once again subtract the mean, compute the SVD, and add the first two principal components as columns in the dataframes

In [5]:
# clear
X_reg = D_reg.iloc[:,:-4].copy()
X_gk = D_gk.iloc[:,:-1].copy()

A = X_reg - X_reg.mean()
B = X_gk - X_gk.mean()


U, S, Vt = np.linalg.svd(A, full_matrices = False)
V = Vt.T

u, s, vt = np.linalg.svd(B, full_matrices = False)
v = vt.T

D_reg['pc1'] = U[:,0]*S[0]
D_reg['pc2'] = U[:,1]*S[1]

D_gk['pc1'] = u[:,0]*s[0]
D_gk['pc2'] = u[:,1]*s[1]

We'll first compare the goalkeepers, by plotting the first two principal components (use the same lmplot code snippet from part 1). Since there are only 5 goalkeeper attributes, we can plot all attributes and see how the two countries stack up.

It appears that Brazilian goalkeepers have a clear advantage in handling, positioning, reflexes, and diving. Kicking is a little more even, but it still looks like Brazil has an advantage.

Furthermore, the best Brazilian goal-keepers seem to be much better than the best Japanese goal-keepers.

Now let's compare the forward players. You will need to first extract the dataset in which D_reg['Position']=='FWD'. Then plot the first two principal components and the projections for attributes [2,9,19,21, 24].

From here, it looks like Japan has many more below-average forwards than Brazil. Nearly all Japanese forwards have below average stamina and reaction, and Brazilian forwards are more likely to have stronger finishing, agility, and shot Power.

Compare the mid-fielder players. First extract the dataset in which D_reg['Position']=='MID'. Then plot the first two principal components and the projections for attributes [8,12,14,18,24].

Again, it seems that Brazilian forwards are more skilled. Even when Japanese mid-fielders are skilled defensive players (so that they are above-average in interceptions), their defensive-minded Brazilian counterparts do not have below-average grades in other skills.

Last, compare the defense players. First extract the dataset in which D_reg['Position']=='DEF'. Then plot the first two principal components and the projections for attributes [1,12, 14, 22,24, 26].

Finally, Japanese defenders are behind Brazilian defenders when it comes to important defensive attributes like interceptions, sliding tackles, aggression, and long passing.

It seems clear that Brazil tends to have more skilled football players than Japan, which should be of no surprise due to Brazil's decades of dominance in the sport. While they never played each other in the 2018 World Cup, it should be no surprise that Brazil finished with a better record, and advanced further in the final bracket.

You can now repeat the analysis for any two countries of your choice! Can principal component analysis explain any of the results from the last world cup? That is, was it obvious beforehand that France would beat Croatia in the final match? Are there any results that are surprising?